-
Notifications
You must be signed in to change notification settings - Fork 258
fix(gke-hyperdisk-test): Replace the tensor flow jobs with FIO jobs #4683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(gke-hyperdisk-test): Replace the tensor flow jobs with FIO jobs #4683
Conversation
dd19029 to
11b9cc6
Compare
5eca840 to
b6a98ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point to the RCA on how pinning the dependency version succeeds the test. The test should not be dependent on a specific version of transformer as its not the core part of the blueprint to test.
Moreover cloud docs for running tensorflow examples are not pinning the dependencies.
The test uses these python libraries in a job it executes. The job (a python code) breaks due to change in the internal behavior of these libraries. Pinning them ensure the code behaves in a consistent manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failures in test should not be propagated as changes to blueprints. The test needs to be fixed to accomodate to this breaking change. Customers when using these models for their workloads will not be using the same version that is pinned here and will find discrepancies
As per our discussion offline, will work on replacing the tensorflow jobs with FIO jobs. |
5983077 to
60e5181
Compare
61351ab to
ed694da
Compare
413f662 to
0daf86c
Compare
ba47760 to
e40b125
Compare
5c579e7 to
e892a94
Compare
…ample blueprint Replaces the tensorflow jobs with FIO jobs in the `gke-managed-hyperdisk` example and accordingly updates its correspoding test playbook.
0ee1ef2 to
2f7f378
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a great improvement, replacing the flaky TensorFlow tests with more stable and focused FIO benchmarks. The implementation looks solid. I have a few suggestions to improve security, reproducibility, and robustness of the test scripts.
In gke-managed-hyperdisk.yaml, I've pointed out a security concern with running the container as root and suggested pinning the ubuntu image version for reproducibility. I also recommended adding a trap to the bash script to ensure cleanup happens even if a benchmark fails.
In test-gke-managed-hyperdisk.yml, I've suggested a more robust way to wait for the job completion and a cleanup of the logging tasks to avoid redundancy.
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-gke-managed-hyperdisk.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-gke-managed-hyperdisk.yml
Outdated
Show resolved
Hide resolved
3405b56
into
GoogleCloudPlatform:develop
Rationale for the Change
The original implementation used TensorFlow jobs to test the GKE Hyperdisk volumes. However, the test has been failing continuously due to changes in dependency module versions (e.g., transformers and datasets).
Pinning dependency versions was deemed an inadequate fix, as the test should not rely on a specific version of a non-core component like the TensorFlow example.
Implementation Details
Following an offline discussion, the approach has been changed to replace the problematic TensorFlow jobs with FIO (Flexible I/O Tester) jobs. FIO is a standard benchmarking and stress tool used to measure I/O performance.
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.